A Study of Text Preprocessing Tools for Arabic Text Categorization
نویسندگان
چکیده
Text preprocessing is an essential stage in text categorization (TC) particularly and text mining generally. Morphological tools can be used in text preprocessing to reduce multiple forms of the word to one form. There has been a debate among researchers about the benefits of using morphological tools in TC. Studies in the English language illustrated that performing stemming during the preprocessing stage degrades the performance slightly. However, they have a great impact on reducing the memory requirement and storage resources needed. The effect of the preprocessing tools on Arabic text categorization is an area of research. This work provides an evaluation study of several morphological tools for Arabic Text Categorization. The study includes using the raw text, the stemmed text, and the root text. The stemmed and root text are obtained using two different preprocessing tools. The results illustrated that using light stemmer combined with a good performing feature selection method enhances the performance of Arabic Text Categorization especially for small threshold values.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملArabic Text Categorization using Machine Learning Approaches
Arabic Text categorization is considered one of the severe problems in classification using machine learning algorithms. Achieving high accuracy in Arabic text categorization depends on the preprocessing techniques used to prepare the data set. Thus, in this paper, an investigation of the impact of the preprocessing methods concerning the performance of three machine learning algorithms, namely...
متن کاملHigh capacity steganography tool for Arabic text using 'Kashida'
Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کامل